The STRING database is a protein-protein interaction prediction database which uses a number of data sources we do not have access to to make protein-protein interaction predictions. Until we have access to these data sources we have to make do with reduced information which can be extracted from publicly available files on the STRING website downloads page. Specifically we will be using the detailed predictions file for homo sapiens:



In [3]:

    
cd ../../









    



/home/gavin/Documents/MRes



In [4]:

    
!mkdir string



In [3]:

    
cd string/









    



/home/gavin/Documents/MRes/string



In [6]:

    
!wget http://string-db.org/newstring_download/protein.links.detailed.v9.1/9606.protein.links.detailed.v9.1.txt.gz









    



--2014-07-04 15:17:21--  http://string-db.org/newstring_download/protein.links.detailed.v9.1/9606.protein.links.detailed.v9.1.txt.gz
Resolving string-db.org (string-db.org)... 194.94.44.34
Connecting to string-db.org (string-db.org)|194.94.44.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37667622 (36M) [application/x-gzip]
Saving to: ‘9606.protein.links.detailed.v9.1.txt.gz’

100%[======================================>] 37,667,622  25.5MB/s   in 1.4s   

2014-07-04 15:17:23 (25.5 MB/s) - ‘9606.protein.links.detailed.v9.1.txt.gz’ saved [37667622/37667622]



In [8]:

    
!gunzip 9606.protein.links.detailed.v9.1.txt.gz



In [12]:

    
!head 9606.protein.links.detailed.v9.1.txt









    



protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score
9606.ENSP00000000233 9606.ENSP00000020673 0 0 0 0 0 0 176 176
9606.ENSP00000000233 9606.ENSP00000054666 0 0 0 0 88 0 309 327
9606.ENSP00000000233 9606.ENSP00000158762 0 0 0 0 0 0 718 718
9606.ENSP00000000233 9606.ENSP00000203407 0 0 0 272 0 0 0 272
9606.ENSP00000000233 9606.ENSP00000203630 0 0 0 241 0 0 0 241
9606.ENSP00000000233 9606.ENSP00000215071 0 0 0 130 0 0 105 170
9606.ENSP00000000233 9606.ENSP00000215115 0 0 0 196 0 0 0 196
9606.ENSP00000000233 9606.ENSP00000215375 0 0 0 279 0 0 0 279
9606.ENSP00000000233 9606.ENSP00000215565 0 0 0 151 0 0 0 151

By inspection we can see that the protein identifiers here are Ensembl protein IDs. In the InterologWalk Notebook a dictionary was saved to map between our Entrez IDs and these IDs. We can reuse this dictionary:



In [4]:

    
cd ../geneconversion/









    



/home/gavin/Documents/MRes/geneconversion



In [5]:

    
import pickle



In [6]:

    
f = open("human.gene2ensemble.pickle")
gene2ensembl = pickle.load(f)
f.close()

To map from the above Ensemble IDs to Entrez IDs the dictionary will have to be inverted:



In [7]:

    
ensembl2gene = {}
for k in gene2ensembl:
    try:
        for p in gene2ensembl[k]:
            ensembl2gene[p] += [k]
    except KeyError:
        for p in gene2ensembl[k]:
            ensembl2gene[p] = [k]

What we would like to do is create a class which stores each of these pairs as keys and each of these feature vectors as values. If it is unable to retreive a feature vector then it should return an empty vector, as that would correspond to each of these evidence terms being zero.

To deal with the fact that the dictionary is not one to one we will have to ensure that each combination of those that map to multiple map to the same feature vector in order to ensure coverage.



In [8]:

    
cd ../string/









    



/home/gavin/Documents/MRes/string



In [9]:

    
import csv



In [10]:

    
import itertools



In [11]:

    
f = open("9606.protein.links.detailed.v9.1.txt")
c = csv.reader(f, delimiter=" ")
c.next()
stringdict = {}
# iterate over rows building dictionary:
for l in c:
    #first build the (possibly various) keys
    try:
        geneids1 = ensembl2gene[l[0].split(".")[1]]
        geneids2 = ensembl2gene[l[1].split(".")[1]]
    except KeyError:
        #give up on pair if they can't be mapped to Entrez
        continue
    #then iterate over their combinations saving the feature vector each entry
    for i1,i2 in itertools.product(geneids1,geneids2):
        stringdict[frozenset([i1,i2])] = l[2:]
f.close()



In [12]:

    
import sys



In [13]:

    
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")



In [14]:

    
import ocbio.ppipred



In [15]:

    
strfeatures = ocbio.ppipred.features(stringdict,stringdict.values()[0])



In [16]:

    
import pickle



In [18]:

    
f = open("human.Entrez.string.pickle","wb")
pickle.dump(strfeatures,f)
f.close()